Take Home Assignment - Origin Financial
Overview
This report has the goal of show all process developed to analyze the data challenge proposed by the Origin Financial team. It’s organized in four main sessions:
The first session will be Engineering Raw Data where I’ll describe all the steps needed to clean raw data. The following step, Exploratory Data Analysis has some descriptive analysis about the data. At this point I’ll guide my analysis to data clustering based on RFM (Recency Frequency and Monetary). After that I’ll show the clustering analysis at session Clustering and some insights based on it. Finally, I have some notes and future ideas further some conclusion about the challenge.
Engineering Raw Data
Before starting the data analysis, let’s do a quickly review at available data sets and raise needs of pre-engineering.
Customers
## Rows: 803
## Columns: 7
## $ id <chr> "131d8363-e2a6-4c60-acef-da03e6bfbdc4", "831216da-744b-4…
## $ created_at <chr> "2020-03-11 19:38:35", "2020-05-01 13:09:00", "2020-03-1…
## $ date_of_birth <chr> "1990-08-16", "1933-11-22", "1986-02-02", "1990-01-01", …
## $ gender <chr> "male", "", "", "male", "", "", "", "male", "male", "mal…
## $ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "U…
## $ state <chr> "CA", "WA", "NY", "CA", "", "CA", "", "CA", "NY", "CA", …
## $ city <chr> "San Francisco", "Tonasket", "New York", "San Francisco"…
In a quick overview, it’s possibly to note that variable created_at and date_of_birth should be modified to timestamp and date, respectively. Also, we can note that all missing data are represented by empty character space and it will be replaced by NA.
The variables state and city has some strange characters like this: <img src='#' onerror=alert('xss') /> I presume that this behavior is caused by a css operator and it’ll be replaced by NA.
The variable id it’s lowercase, while in the transactions data it’s upper-cased, so it’ll be replaced by uppercase.
Transactions
## Rows: 11,059
## Columns: 17
## $ X_id <chr> "4D2119A1-03D9-48F2-99B7-FEDD…
## $ user_id <chr> "94102846-0B6D-45D8-AFD1-DA80…
## $ account_id <chr> "88F8D694-AFF0-4CFC-BBB6-C89D…
## $ account_name <chr> "Chase - Plaid Money Market (…
## $ description <chr> "ACH Electronic CreditGUSTO P…
## $ type <chr> "expense", "expense", "expens…
## $ amount <dbl> -5850, -5850, -5850, -5850, -…
## $ date <chr> "2021-06-08T00:00:00.000Z", "…
## $ extra_fields.category.0 <chr> "Transfer", "Transfer", "Tran…
## $ extra_fields.category.1 <chr> "Debit", "Debit", "Debit", "D…
## $ extra_fields.category.2 <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.category_id <int> 21006000, 21006000, 21006000,…
## $ extra_fields.merchant_name <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.name <chr> "ACH Electronic CreditGUSTO P…
## $ extra_fields.payment_channel <chr> "other", "other", "other", "o…
## $ extra_fields.payment_meta.payment_method <chr> "ACH", "ACH", "ACH", "ACH", "…
## $ created_at <chr> "2021-06-18T19:14:32.075Z", "…
The transaction data doesn’t so much engineering to do. It’s just change timestamp and date types and replace empty spaces by NA. I also, replaced . by _ in some variable names.
Exploratory Data Analysis
The exploratory data analysis it’s divides
| Name | db_users |
| Number of rows | 803 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 36 | 36 | 0 | 803 | 0 |
| created_at | 0 | 1.00 | 2414 | 5633 | 0 | 803 | 0 |
| gender | 751 | 0.06 | 4 | 10 | 0 | 4 | 0 |
| country | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| state | 179 | 0.78 | 2 | 2 | 0 | 27 | 0 |
| city | 176 | 0.78 | 4 | 21 | 0 | 120 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_of_birth | 173 | 0.78 | 1000-01-01 | 2001-01-01 | 1989-12-12 | 119 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age_at | 173 | 0.78 | 33.56 | 40.21 | 19.41 | 29.63 | 31.08 | 33.99 | 1020.33 | ▇▁▁▁▁ |
At this simple description about the data set it’s possibly to note that age has a outlier that will be replaced by NA. Also, the great number of missing data in the variable gender, turning it almost useless in the data analysis.
Univariate Analysis
Age
The customers age distribution are concentraded between the range 30 - 50 years as shown bellow:
Maybe this variable could bea good one to use in the clustering process.
State
The geographical distribuiton of customer are really dense in the California State (31%) followed by MD(11.1%) add NC/NY tied with (7.8%). It’s important to note that more than 22% of the customers has no state information.
Users Transactions
Looking to the user transactions, I note a tiny number of different customers, just 84. Some of them (almost 20%) has more than one account, but as we expecte to analysi customer behavior over account behaviour I’ll treat them as the same customer.
| Number of Accounts | Number of Customers | % of Customers |
|---|---|---|
| 1 | 70 | 83.33 |
| 2 | 7 | 8.33 |
| 3 | 1 | 1.19 |
| 4 | 2 | 2.38 |
| 5 | 2 | 2.38 |
| 6 | 1 | 1.19 |
| 7 | 1 | 1.19 |
At this point it’s really important to say that just 84 customers to do the cluster analysis could be a problem for a bunch of reasons as like:
Some algorithms won’t provide robust results under small samples
The cluster interpretation could be a little messy.
The pattern recognition it’s hard to do
Sad that I’ll develop my analysis and won’t
Type of transacions
Analysing the type of transactions we can note that has a mix of too different transactions. Income, Expense and Transfer, as I’ll show in the next steps, it will be treated separatelly to do the clustering analysis.
The amount money transactions types has a strange behaviour having low variety of values. Transfer for example has just the value 25. It’s something that could affect the behavior of clusterig analysis.
Looking to number and amount of transactions per day by expenses, transfer and income it’s possibly to note that the days 2021-06-15 (income and expense) and 2021-06-16 (income) has a lot more transactions compare to other days. Expense category it’s almost 10x the median for the period and for income it’s mre than 10x.
I don’t know if this strange behavior comes from a selection bias or if this is from the data. I’ll presume that this behavior is a normal one.
Transactions Category
Looking the transactions category we can note that we have a lot of missing ones, but part of that is caused by income transactions.
| extra_fields_category_0 | extra_fields_category_1 | extra_fields_category_2 | extra_fields_category_id | n | perc |
|---|---|---|---|---|---|
| NA | NA | NA | NA | 2742 | 24.79 |
| Travel | Taxi | NA | 22016000 | 1230 | 11.12 |
| Food and Drink | Restaurants | Fast Food | 13005032 | 1057 | 9.56 |
| Travel | Airlines and Aviation Services | NA | 22001000 | 1055 | 9.54 |
| Food and Drink | Restaurants | NA | 13005000 | 1053 | 9.52 |
| Payment | Credit Card | NA | 16001000 | 623 | 5.63 |
| Food and Drink | Restaurants | Coffee Shop | 13005043 | 614 | 5.55 |
| Transfer | Credit | NA | 21005000 | 607 | 5.49 |
| Payment | NA | NA | 16000000 | 443 | 4.01 |
| Recreation | Gyms and Fitness Centers | NA | 17018000 | 443 | 4.01 |
| Shops | Sporting Goods | NA | 19046000 | 443 | 4.01 |
| Transfer | Debit | NA | 21006000 | 425 | 3.84 |
| Transfer | Deposit | NA | 21007000 | 324 | 2.93 |
Almost 20% of transactions came from travel category, divided into taxi (Uber) and Airlines and Aviation.
Food and drink represent 25% divided in restaurants like coffe shop and fast foods.
Also has some transactions repesented by recreation and shops ans some transfer between accounts
Merchant Name
Looking deeper to the expense types and merchant names, we can see at the top Uber, McDonalds, starbaucks and so on.
Channel Transactions
The preferencial channel to expenses is in store(73%). Would be interesting if we had some information abou online channels as app payment, website shop e etc.
Payment Methods
By payment methods we have just 4% of transactions with this data, it’s almost useless to the analysis.
RFM Transformation
After this exploratory analysis about variables and looking to a solution to clustering problem . I decided transform the transactional data into recency frequency and monetary value for the tree types of transactions: Expense, Income and Transfer.
After this transformation our dataset looks like this:
## Rows: 84
## Columns: 33
## $ id <chr> "AD7226ED-2D26-45FD-AB37-C3823…
## $ state <chr> "NY", "CA", "IL", "NY", NA, "V…
## $ age_at <dbl> 34.10458, 32.52759, 33.34352, …
## $ count_transactions_expense <dbl> 121, 120, 144, 538, 270, 0, 0,…
## $ amount_expense <dbl> 2825.37, 15654.90, 18785.88, 2…
## $ amount_mean_expense <dbl> 23.35017, 130.45750, 130.45750…
## $ count_transactions_income <dbl> 24, 210, 252, 50, 24, 6, 6, 6,…
## $ amount_income <dbl> 12000.00, 585798.60, 702958.32…
## $ amount_mean_income <dbl> 5.000000e+02, 2.789517e+03, 2.…
## $ count_transactions_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_mean_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ n_transactions_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ n_transactions_food_and_drink <dbl> 72, 0, 0, 246, 122, 0, 0, 0, 0…
## $ n_transactions_payment <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0…
## $ n_transactions_recreation <dbl> 0, 0, 0, 50, 24, 0, 0, 0, 0, 0…
## $ n_transactions_shops <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0…
## $ n_transactions_travel <dbl> 49, 0, 0, 146, 74, 0, 0, 0, 0,…
## $ transactions_amount_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ transactions_amount_food_and_drink <dbl> 2537.52, 0.00, 0.00, 53286.50,…
## $ transactions_amount_payment <dbl> 0.0, 0.0, 0.0, 99768.0, 51962.…
## $ transactions_amount_recreation <dbl> 0, 0, 0, 3925, 1884, 0, 0, 0, …
## $ transactions_amount_shops <dbl> 0, 0, 0, 24000, 12500, 0, 0, 0…
## $ transactions_amount_travel <dbl> 287.85, 0.00, 0.00, 24573.84, …
## $ transactions_amount_mean_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ transactions_amount_mean_food_and_drink <dbl> 35.24333, 0.00000, 0.00000, 21…
## $ transactions_amount_mean_payment <dbl> 0.0, 0.0, 0.0, 2078.5, 2078.5,…
## $ transactions_amount_mean_recreation <dbl> 0.0, 0.0, 0.0, 78.5, 78.5, 0.0…
## $ transactions_amount_mean_shops <dbl> 0, 0, 0, 500, 500, 0, 0, 0, 0,…
## $ transactions_amount_mean_travel <dbl> 5.87449, 0.00000, 0.00000, 168…
## $ no_channel <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, …
## $ other <dbl> 0.1655172, 0.0000000, 0.000000…
## $ in_store <dbl> 0.8344828, 0.0000000, 0.000000…
Just 84 data points (customers) and 33 variables. I drop the recency variables because I think that isn’t the goal of the analysis explore “how much time the customer doesn’t have any transaction”.
Expense vs Income
Firstly, analyzing the behavior of the amount mean of transactions in expenses and income we can note some behaviors.
Note that some customer has high income mean and low expense mean, while we have others that expend more than earn. This is a good spoiler about the clustering.
It’s important to note the outlier that expend much more than earn, I will keep this customer in analysis because it’s a different behavior that could be more representative if we have more data.
Expense Vs Income Vs Age
Adding the variable age to our analysis, it’s possibly to note that it doesn’t affect the behavior of income and expense, so this variable wouldn’t be useful in the cluster analysis.
Expense vs Income vs Transfer
Transfer mean amount it’s another variable that doesn’t affect the income and expense. As shown bellow:
Relationship between categorys of expenses
Looking to the correlation between category expenses it’s possibly to note that every transaction has a high correlation. This analysis isn’t so conclusive because we have just 13 customers with expenses categories.
Also, I will try to do a cluster with this datapoints, the main ideia is to show my way of thinking the problem than try to interpret something. Because just 13 customer isn’t enough data to any analysis.
Clustering
At the clustering analysis my goal is to made two type of clusters, one based on incomes and expenses and another one based on the categories of expense. I think this way could give us some conclusions about the data.
It’s important to say that I use just one method of clustering called kmeans, it’s based on centroid distances. This choice was made because the data sets has few data points and I would like to maintain the analysis as simple as possibly.
At the beginning of every clustering I will do a analysis to determine the optimal number of cluster, after that the cluster analysis will be made and also insights about the clusters responses.
To determine de optimal number of cluster I run tree methods:
Elbow method - The idea behind this method is compute the within-cluster variation to different number of clusters and minimize it.
Silhouette Method - It measures the quality of clustering based on the average similarity of each point to to its own cluster compared to other clusters.
GAP method - The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).
Type Transactions
Expense vs Income
We can visualize the results of methods to determine the optimal number of clusters. Two of them indicates the best number of cluster should be 6 and the elbow method indicates 4 clusters.
Using 4 clusters we can have some insights about the customers:
1 - Customers with no income and no expense
2 - Customers with medium income and medium expense
3 - Customers with high income and low expense
4 - Customers with no income and high expense
| Type Clustering | Cluster | Number Customers |
|---|---|---|
| cluster_4 | 1 | 1 |
| cluster_4 | 2 | 9 |
| cluster_4 | 3 | 5 |
| cluster_4 | 4 | 69 |
| cluster_6 | 1 | 4 |
| cluster_6 | 2 | 5 |
| cluster_6 | 3 | 1 |
| cluster_6 | 4 | 66 |
| cluster_6 | 5 | 5 |
| cluster_6 | 6 | 3 |
Note that using 6 clusters we have few customers in some clusters and it can cause the suggestion of the methods, because the distance between members of cluster has more chance to be tiny.
Expense vs Income vs Freq Expense
Adding the frequency of expenses we can visualize better the clusters.
The methods indicates 5 or 6 clusters, I used 5 in the analysis:
Note the interpretation about the clusters:
1 - Customers with medium income, lot of expense transactions and medium amount average expense;
2 - Customers with no amount expense;
3 - Customers with medium income, medium transactions expense and medium amount expense;
4 - Outliers, the customer that has really high amount expense in few transactions and has no income;
5 - Customer with high income and low average amount expense with medium transactions.
At this point we have some good interpretation about users transactions. Let’s see about expenses categories.
Category Expense
We have just 18 customers with expenses, so it’s hard to extract some helpful interpretation or relationship in the data. But let’s look to clustering methods.
At this point I use the relative amount mean expend to the total amount. The same analogy to the frequency.
Amount Type Expense
The methods indicates just 2 clusters relatives to the amount mean expense in each category.
| Cluster | Transfer | Food and Drink | Payment | Recreation | Shops | Travel |
|---|---|---|---|---|---|---|
| 1 | 0.09 | 0.30 | 0.18 | 0.01 | 0.04 | 0.07 |
| 2 | 0.69 | 0.08 | 0.15 | 0.01 | 0.04 | 0.04 |
The only difference between groups looks that the Group 2 expends more with transfers and the group one with food and drink. But nothing too conclusive.
Frequency Type Expense
The methods indicates just 2 clusters relatives to the amount mean expense in each category.
| Cluster | Transfer | Food and Drink% | Payment | Recreation | Shops | Travel |
|---|---|---|---|---|---|---|
| 1 | 0.09 | 0.31 | 0.04 | 0.04 | 0.04 | 0.20 |
| 2 | 0.15 | 0.37 | 0.07 | 0.07 | 0.07 | 0.22 |
There aren’t any conclusion about the groups, the expenses are really similar.
Conclusion
Our goal with this analysis was to find some group of customers that has similar behavior in the transaction data. I led the analysis using the approach o frequency and monetary value expenses, incomes and transfers. Initially we had 803 customers to cluster, after the join between data sets left just 84 customer with transaction.
Based on it, I achieve some good results in clustering customer using income and expenses. A couple of behaviors were detected. Unfortunately using the expenses data there was no conclusion about the groups, it’s important stand out that we had just 18 customers, something that makes hard to reach a closure.
Notes
This take home assignment was very challenging because it’s a open problem and I could have a lot of different ways to do the analysis, My choice was explore the RFM customer behavior and at the end just FM.
I choose to explore just one method of clustering to made things simple. But in a work day problem I would test some other method like hierarchical clustering and principally density based clustering like Dbscan and Hdbscan, because this methods has the advantage of not classifying some data points with the idea of “Not all customers belongs a cluster”.
I would like to thank the Origin team for dedicate time to elaborate this problem and give me the chance show my work.